This document demonstrates the basic process of Maximum Likelihood Estimation (MLE) through manual iteration. We’ll generate a small dataset from a normal distribution, write the log-likelihood function, and then manually optimize it by testing different parameter values to find the maximum, visualizing each step of the process.
Setup and Data Generation
First, let’s set a seed for reproducibility and generate a small sample from a normal distribution:
Code
# Load required librarieslibrary(ggplot2)library(dplyr)library(knitr)library(gridExtra)library(gganimate)library(gifski)library(transformr) # For smooth transitions in animations# Define color palettebucolors <-c("#005A43", "#6CC24A", "#A7DA92", "#BDBEBD", "#000000")# Set seed for reproducibilityset.seed(123)# Generate a small dataset from a normal distribution# True parameters: mean = 10, sd = 2true_mean <-10true_sd <-2n <-20# sample sizedata <-rnorm(n, mean = true_mean, sd = true_sd)# Display the datahead(data)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.067 9.013 10.240 10.283 11.097 13.574
Let’s visualize our generated data:
Code
# Create data frame for plottingdf <-data.frame(x = data)# Plot histogram with ggplotggplot(df, aes(x = x)) +geom_histogram(bins =10, fill = bucolors[[1]], color ="white", alpha =0.7) +geom_vline(aes(xintercept =mean(x)), color = bucolors[[1]], linewidth =1.2, linetype ="dashed") +labs(title ="Histogram of Generated Data",subtitle =paste("Sample mean =", round(mean(data), 2)),x ="Value",y ="Count") +theme_minimal() +theme(plot.title =element_text(face ="bold"),axis.title =element_text(face ="bold") )
The Log-Likelihood Function
Looking at our generated data, it appears to be continuous and approximately normally distributed. Let’s formalize our approach to finding the maximum likelihood estimates.
For a normal distribution with parameters \(\mu\) (mean) and \(\sigma\) (standard deviation), the probability density function (PDF) for a single observation \(x\) is:
Note that while \(\mu\) (mean) and \(\sigma\) (standard deviation) are both parameters of the normal distribution, \(\pi\) is the constant 3.14 and not a parameter to be estimated. Also, we’ll fix \(\sigma\) at its true value for simplicity and focus on estimating \(\mu\).
For our sample \(x_1, x_2, \ldots, x_n\), assuming independent observations, the joint density function is the product of individual densities - in other words, the probablity of observing all of these data points together given the parameters is:
Addition is easier than multiplication. Recalling that the log of the product is equal to the sum of the logs, we can simplify the likelihood function by taking the natural logarithm:
The goal of MLE is to find the parameter values that maximize the log-likelihood function. In our case, we want to find the value of \(\mu\) that most likely produced the observed data. To do this, we’re going to manually iterate through different values of \(\mu\), calculate the log-likelihood at each step, save those log-likelihood values, and visualize the process by plotting the log-likelihoods against the trial values of \(\mu\).
Let’s implement this function in R. For simplicity, we’ll focus on estimating just the mean (\(\mu\)) and keep the standard deviation fixed at its true value:
Code
# Log-likelihood function for normal distribution# Keeping sigma fixed at true value for simplicitylog_likelihood <-function(mu, data, sigma = true_sd) { n <-length(data) log_lik <--n/2*log(2*pi) - n *log(sigma) -sum((data - mu)^2) / (2* sigma^2)return(log_lik)}
Manual Step-by-Step Iteration
Now, let’s manually iterate through different values of \(\mu\) to find the one that maximizes the log-likelihood. We’ll start with a guess far from the actual mean and gradually move closer, documenting each step.
Let’s first create a function to visualize our current position on the log-likelihood curve:
Code
visualize_loglik <-function(mu_current, data, history =NULL, mu_range =c(8, 12)) {# Use the global bucolors variable# This avoids the recursive reference issue# Generate a sequence for the x-axis (possible mu values) mu_values <-seq(mu_range[1], mu_range[2], by =0.1)# Calculate log-likelihood for each mu value ll_values <-sapply(mu_values, log_likelihood, data = data)# Create data frame for plotting df_ll <-data.frame(mu = mu_values, log_likelihood = ll_values)# Create base plot p <-ggplot(df_ll, aes(x = mu, y = log_likelihood)) +geom_line(color = bucolors[[1]], linewidth =1.2) +labs(title ="Log-Likelihood Function",subtitle =paste("Current guess: μ =", round(mu_current, 4)),x ="μ (Mean Parameter)",y ="Log-Likelihood" ) +theme_minimal() +theme(plot.title =element_text(face ="bold"),axis.title =element_text(face ="bold") )# Add vertical line for sample mean (true MLE) p <- p +geom_vline(xintercept =mean(data), color = bucolors[[5]], linetype ="dashed", linewidth =0.8)# Add current position current_ll <-log_likelihood(mu_current, data) p <- p +geom_point(data =data.frame(mu = mu_current, log_likelihood = current_ll),color = bucolors[[2]],size =4 )# Add history if providedif (!is.null(history) &&nrow(history) >0) { p <- p +geom_path(data = history,aes(x = mu, y = log_likelihood),color = bucolors[[3]],linetype ="dashed",linewidth =1 ) +geom_point(data = history,aes(x = mu, y = log_likelihood),color = bucolors[[3]],size =3,alpha =0.7 )# Add iteration labels p <- p +geom_text(data = history,aes(x = mu, y = log_likelihood, label = iteration),hjust =-0.5,vjust =-0.5,color = bucolors[[5]],size =4 ) }return(p)}
Iteration 1: Initial Guess
Let’s start with an initial guess far from the true value:
Code
# Initial guessmu_current <-8.0# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Start our history data framehistory <-data.frame(iteration =1,mu = mu_current,log_likelihood = ll_current)# Display current valuescat("Iteration 1:\n")
Iteration 1:
Code
cat(" μ =", mu_current, "\n")
μ = 8
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -54.2625
Code
cat(" Sample mean (target) =", mean(data), "\n\n")
Sample mean (target) = 10.28325
Code
# Visualizevisualize_loglik(mu_current, data)
Let’s examine in detail how we calculated the log-likelihood for our initial guess of μ = 8.0:
Code
# Calculate individual log-likelihood terms for each data pointindividual_loglik <-data.frame(observation = data,term =-0.5*log(2*pi) -log(true_sd) - (data - mu_current)^2/ (2* true_sd^2))# Display the calculationcat("Step 1: Calculate log-likelihood for each observation using μ =", mu_current, "\n\n")
Step 1: Calculate log-likelihood for each observation using μ = 8
cat("\nStep 2: Sum all individual log-likelihood terms\n")
Step 2: Sum all individual log-likelihood terms
Code
cat("Sum =", sum(individual_loglik$term), "\n")
Sum = -54.2625
Code
cat("This matches our function output:", ll_current, "\n\n")
This matches our function output: -54.2625
Code
# Calculate a more human-readable versioncat("To understand this value, let's compare the probability densities:\n")
To understand this value, let's compare the probability densities:
Code
densities <-data.frame(observation = data,density_at_mu_8 =dnorm(data, mean = mu_current, sd = true_sd),density_at_sample_mean =dnorm(data, mean =mean(data), sd = true_sd))# Show first few rowshead(densities)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1729 0.9299 1.8720 3.3222 3.0700 12.5529
Code
cat("Values > 1 indicate the sample mean gives higher probability density\n")
Values > 1 indicate the sample mean gives higher probability density
Iteration 2: Move Toward the Center
Let’s try a value closer to the center of our data:
Code
# New guessmu_current <-9.0# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Add to historyhistory <-rbind(history, data.frame(iteration =2,mu = mu_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 2:\n")
Iteration 2:
Code
cat(" μ =", mu_current, "\n")
μ = 9
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -45.34626
Code
cat(" Improvement from previous:", ll_current - history$log_likelihood[1], "\n\n")
Improvement from previous: 8.916238
Code
# Visualize with historyvisualize_loglik(mu_current, data, history)
Iteration 3: Getting Closer
The log-likelihood increased, so we’re moving in the right direction. Let’s try a higher value:
Code
# New guessmu_current <-9.5# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Add to historyhistory <-rbind(history, data.frame(iteration =3,mu = mu_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 3:\n")
Iteration 3:
Code
cat(" μ =", mu_current, "\n")
μ = 9.5
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -42.76315
Code
cat(" Improvement from previous:", ll_current - history$log_likelihood[2], "\n\n")
Improvement from previous: 2.583119
Code
# Visualize with historyvisualize_loglik(mu_current, data, history)
Iteration 4: Even Closer
Let’s try an even higher value:
Code
# New guessmu_current <-10.0# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Add to historyhistory <-rbind(history, data.frame(iteration =4,mu = mu_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 4:\n")
Iteration 4:
Code
cat(" μ =", mu_current, "\n")
μ = 10
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -41.43003
Code
cat(" Improvement from previous:", ll_current - history$log_likelihood[3], "\n\n")
Improvement from previous: 1.333119
Code
# Visualize with historyvisualize_loglik(mu_current, data, history)
Iteration 5: Test a Higher Value
Let’s see if going higher still improves the likelihood:
Code
# New guessmu_current <-10.5# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Add to historyhistory <-rbind(history, data.frame(iteration =5,mu = mu_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 5:\n")
Iteration 5:
Code
cat(" μ =", mu_current, "\n")
μ = 10.5
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -41.34691
Code
cat(" Improvement from previous:", ll_current - history$log_likelihood[4], "\n\n")
Improvement from previous: 0.08311901
Code
# Visualize with historyvisualize_loglik(mu_current, data, history)
Iteration 6: Have We Gone Too Far?
Since the log-likelihood decreased, we’ve gone too far. Let’s try a value between our last two guesses:
Code
# New guess - somewhere between the last twomu_current <-10.2# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Add to historyhistory <-rbind(history, data.frame(iteration =6,mu = mu_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 6:\n")
Iteration 6:
Code
cat(" μ =", mu_current, "\n")
μ = 10.2
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -41.24678
Code
cat(" Improvement from previous iteration 4:", ll_current - history$log_likelihood[4], "\n\n")
Improvement from previous iteration 4: 0.1832476
Code
# Visualize with historyvisualize_loglik(mu_current, data, history)
Iteration 7: Fine-Tuning
Let’s try a slightly higher value to see if we can get even closer to the maximum:
Code
# New guess - a bit highermu_current <-10.25# Calculate log-likelihood at this pointll_current <-log_likelihood(mu_current, data)# Add to historyhistory <-rbind(history, data.frame(iteration =7,mu = mu_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 7:\n")
Iteration 7:
Code
cat(" μ =", mu_current, "\n")
μ = 10.25
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -41.23222
Code
cat(" Improvement from previous:", ll_current - history$log_likelihood[6], "\n\n")
Improvement from previous: 0.0145619
Code
# Visualize with historyvisualize_loglik(mu_current, data, history)
Comparing with the True MLE
For a normal distribution, we know that the maximum likelihood estimator for the mean is simply the sample mean - so the “true MLE” in this case is the sample mean of the variable. Let’s verify:
Code
# Calculate the sample meansample_mean <-mean(data)# Calculate log-likelihood at the sample meanll_at_sample_mean <-log_likelihood(sample_mean, data)# Add to our historyhistory <-rbind(history, data.frame(iteration =8,mu = sample_mean,log_likelihood = ll_at_sample_mean ))# Display valuescat("Sample Mean (True MLE):\n")
Sample Mean (True MLE):
Code
cat(" μ =", sample_mean, "\n")
μ = 10.28325
Code
cat(" Log-likelihood =", ll_at_sample_mean, "\n")
Log-likelihood = -41.22945
Code
cat(" Improvement from our best guess:", ll_at_sample_mean -max(history$log_likelihood[1:7]), "\n\n")
Improvement from our best guess: 0.002763508
Code
# Final visualizationvisualize_loglik(sample_mean, data, history)
Summary of Iterations
Let’s summarize all our iterations to see how we’ve progressed:
Code
# Create a summary tablehistory$improvement <-c(NA, diff(history$log_likelihood))history$distance_from_mle <-abs(history$mu - sample_mean)# Format the tablekable(history, col.names =c("Iteration", "μ", "Log-Likelihood", "Improvement", "Distance from MLE"),digits =c(0, 4, 4, 4, 4),caption ="Summary of MLE Iterations")
Summary of MLE Iterations
Iteration
μ
Log-Likelihood
Improvement
Distance from MLE
1
8.0000
-54.2625
NA
2.2832
2
9.0000
-45.3463
8.9162
1.2832
3
9.5000
-42.7631
2.5831
0.7832
4
10.0000
-41.4300
1.3331
0.2832
5
10.5000
-41.3469
0.0831
0.2168
6
10.2000
-41.2468
0.1001
0.0832
7
10.2500
-41.2322
0.0146
0.0332
8
10.2832
-41.2295
0.0028
0.0000
Another Example: MLE for a Binary Variable
Let’s illustrate MLE with a different type of data: binary observations following a Bernoulli distribution.
Setup and Data Generation for Binary Variable
We’ll generate binary data (0s and 1s) with a true probability parameter p:
Code
# Set seed for reproducibilityset.seed(456)# Generate a sample from a Bernoulli distribution# True parameter: p = 0.7 (probability of success)true_p <-0.7n_binary <-50# sample sizebinary_data <-rbinom(n_binary, size =1, prob = true_p)# Display a summary of the datatable(binary_data)
binary_data
0 1
19 31
Code
mean(binary_data) # sample proportion of 1s
[1] 0.62
Let’s visualize the distribution of our binary data:
Code
# Create a data frame for plottingbinary_df <-data.frame(value =factor(binary_data, levels =c(0, 1), labels =c("Failure (0)", "Success (1)")),count =1)# Create a bar plotggplot(binary_df, aes(x = value, fill = value)) +geom_bar() +scale_fill_manual(values =c(bucolors[1], bucolors[2])) +labs(title ="Distribution of Binary Data",subtitle =paste("Sample proportion of successes =", round(mean(binary_data), 3)),x ="Outcome",y ="Count") +theme_minimal() +theme(plot.title =element_text(face ="bold"),axis.title =element_text(face ="bold"),legend.position ="none" )
The Log-Likelihood Function for Binary Data
For binary data, we use the Bernoulli distribution with parameter p (probability of success). A Bernoulli distribution is a special case of the Binomial distribution with a single trial - since we end up with \(N\) cases (trials) binary models like the logit or probit are in the Binomial family.
The probability density function (PDF) for a single observation x is:
\(f(x|p) = p^x \cdot (1-p)^{1-x}\)
Where x can be either 0 or 1. This formulation handles both cases:
When x = 1: f(1|p) = p
When x = 0: f(0|p) = 1-p
For our sample \(x_1, x_2, \ldots, x_n\), assuming independent observations, the joint density function (likelihood) is:
# Log-likelihood function for Bernoulli distributionbinary_log_likelihood <-function(p, data) { s <-sum(data) # number of successes n <-length(data) # total number of trials# Calculate log-likelihood log_lik <- s *log(p) + (n - s) *log(1- p)return(log_lik)}
Manual Iteration for Binary Data
Let’s find the maximum likelihood estimate for the probability parameter \(p\) using manual iteration, starting with a guess that’s far from the true value. The process is the same as what we did in the normal case above - we try different values for the parameter (\(p\)), calculate the log-likelihood at each step, and visualize the process by plotting the log-likelihoods against the trial values of \(p\).
Code
# Function to visualize log-likelihood for binary datavisualize_binary_loglik <-function(p_current, data, history =NULL) {# Generate a sequence of probability values p_values <-seq(0.1, 0.9, by =0.01)# Calculate log-likelihood for each p value ll_values <-sapply(p_values, binary_log_likelihood, data = data)# Create data frame for plotting df_ll <-data.frame(p = p_values, log_likelihood = ll_values)# Create base plot p <-ggplot(df_ll, aes(x = p, y = log_likelihood)) +geom_line(color = bucolors[1], linewidth =1.2) +labs(title ="Log-Likelihood Function (Bernoulli Distribution)",subtitle =paste("Current guess: p =", round(p_current, 4)),x ="p (Probability Parameter)",y ="Log-Likelihood" ) +theme_minimal() +theme(plot.title =element_text(face ="bold"),axis.title =element_text(face ="bold") )# Add vertical line for sample proportion (true MLE) p <- p +geom_vline(xintercept =mean(data), color = bucolors[5], linetype ="dashed", linewidth =0.8)# Add current position current_ll <-binary_log_likelihood(p_current, data) p <- p +geom_point(data =data.frame(p = p_current, log_likelihood = current_ll),color = bucolors[2],size =4 )# Add history if providedif (!is.null(history) &&nrow(history) >0) { p <- p +geom_path(data = history,aes(x = p, y = log_likelihood),color = bucolors[3],linetype ="dashed",linewidth =1 ) +geom_point(data = history,aes(x = p, y = log_likelihood),color = bucolors[3],size =3,alpha =0.7 )# Add iteration labels p <- p +geom_text(data = history,aes(x = p, y = log_likelihood, label = iteration),hjust =-0.5,vjust =-0.5,color = bucolors[5],size =4 ) }return(p)}# Initial guessp_current <-0.3# Calculate log-likelihood at this pointll_current <-binary_log_likelihood(p_current, binary_data)# Start our history data framebinary_history <-data.frame(iteration =1,p = p_current,log_likelihood = ll_current)# Display current valuescat("Iteration 1:\n")
Let’s examine in detail how we calculated the log-likelihood for our initial guess of p = 0.3:
Code
# Calculate individual log-likelihood terms for each data pointbinary_individual_loglik <-data.frame(observation = binary_data,term =ifelse(binary_data ==1, log(p_current), # log(p) for successeslog(1- p_current)) # log(1-p) for failures)# Display the first few individual termscat("Step 1: Calculate log-likelihood for each observation using p =", p_current, "\n\n")
Step 1: Calculate log-likelihood for each observation using p = 0.3
# New guessp_current <-0.5# Calculate log-likelihood at this pointll_current <-binary_log_likelihood(p_current, binary_data)# Add to historybinary_history <-rbind(binary_history, data.frame(iteration =2,p = p_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 2:\n")
Iteration 2:
Code
cat(" p =", p_current, "\n")
p = 0.5
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -34.65736
Code
cat(" Improvement from previous:", ll_current - binary_history$log_likelihood[1], "\n\n")
Improvement from previous: 9.442622
Code
# Visualize with historyvisualize_binary_loglik(p_current, binary_data, binary_history)
Iteration 3: Getting Closer
Let’s move closer to the sample proportion:
Code
# New guessp_current <-0.6# Calculate log-likelihood at this pointll_current <-binary_log_likelihood(p_current, binary_data)# Add to historybinary_history <-rbind(binary_history, data.frame(iteration =3,p = p_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 3:\n")
Iteration 3:
Code
cat(" p =", p_current, "\n")
p = 0.6
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -33.24512
Code
cat(" Improvement from previous:", ll_current - binary_history$log_likelihood[2], "\n\n")
Improvement from previous: 1.412241
Code
# Visualize with historyvisualize_binary_loglik(p_current, binary_data, binary_history)
Iteration 4: Very Close
Let’s try an even higher value:
Code
# New guessp_current <-0.68# Calculate log-likelihood at this pointll_current <-binary_log_likelihood(p_current, binary_data)# Add to historybinary_history <-rbind(binary_history, data.frame(iteration =4,p = p_current,log_likelihood = ll_current ))# Display current valuescat("Iteration 4:\n")
Iteration 4:
Code
cat(" p =", p_current, "\n")
p = 0.68
Code
cat(" Log-likelihood =", ll_current, "\n")
Log-likelihood = -33.60479
Code
cat(" Improvement from previous:", ll_current - binary_history$log_likelihood[3], "\n\n")
Improvement from previous: -0.35967
Code
# Visualize with historyvisualize_binary_loglik(p_current, binary_data, binary_history)
Comparing with the True MLE
For a Bernoulli distribution, we know that the maximum likelihood estimator for the probability p is simply the sample proportion of successes. Let’s verify:
Code
# Calculate the sample proportion (MLE)p_mle <-mean(binary_data)# Calculate log-likelihood at the MLEll_at_mle <-binary_log_likelihood(p_mle, binary_data)# Add to our historybinary_history <-rbind(binary_history, data.frame(iteration =5,p = p_mle,log_likelihood = ll_at_mle ))# Display valuescat("Sample Proportion (True MLE):\n")
Sample Proportion (True MLE):
Code
cat(" p =", p_mle, "\n")
p = 0.62
Code
cat(" Log-likelihood =", ll_at_mle, "\n")
Log-likelihood = -33.20321
Code
cat(" Improvement from our best guess:", ll_at_mle -max(binary_history$log_likelihood[1:4]), "\n\n")
Improvement from our best guess: 0.04191191
Code
# Final visualizationvisualize_binary_loglik(p_mle, binary_data, binary_history)
Summary of Binary MLE Iterations
Let’s summarize the iterations for the binary example:
Code
# Create a summary tablebinary_history$improvement <-c(NA, diff(binary_history$log_likelihood))binary_history$distance_from_mle <-abs(binary_history$p - p_mle)# Format the tablekable(binary_history, col.names =c("Iteration", "p", "Log-Likelihood", "Improvement", "Distance from MLE"),digits =c(0, 4, 4, 4, 4),caption ="Summary of MLE Iterations (Binary Data)")
Summary of MLE Iterations (Binary Data)
Iteration
p
Log-Likelihood
Improvement
Distance from MLE
1
0.30
-44.1000
NA
0.32
2
0.50
-34.6574
9.4426
0.12
3
0.60
-33.2451
1.4122
0.02
4
0.68
-33.6048
-0.3597
0.06
5
0.62
-33.2032
0.4016
0.00
Conclusion
This step-by-step demonstration illustrates how Maximum Likelihood Estimation works through manual iteration. We’ve shown how to:
Start with an initial guess for the parameter (μ)
Evaluate the log-likelihood at different parameter values
Iteratively refine our guess to maximize the log-likelihood
Visualize the search process and our progress toward the maximum
For the normal distribution, we know analytically that the MLE for the mean is simply the sample mean (μ̂ = x̄), which our manual process confirmed.
In real-world applications with more complex models, the same principles apply, but statistical software uses optimization algorithms (often rooted in Newton’s method) to find the maximum likelihood estimates efficiently and accurately.